Dropout Training for Support Vector Machines

نویسندگان

  • Ning Chen
  • Jun Zhu
  • Jianfei Chen
  • Bo Zhang
چکیده

Dropout and other feature noising schemes have shown promising results in controlling over-fitting by artificially corrupting the training data. Though extensive theoretical and empirical studies have been performed for generalized linear models, little work has been done for support vector machines (SVMs), one of the most successful approaches for supervised learning. This paper presents dropout training for linear SVMs. To deal with the intractable expectation of the non-smooth hinge loss under corrupting distributions, we develop an iteratively re-weighted least square (IRLS) algorithm by exploring data augmentation techniques. Our algorithm iteratively minimizes the expectation of a re-weighted least square problem, where the re-weights have closedform solutions. The similar ideas are applied to develop a new IRLS algorithm for the expected logistic loss under corrupting distributions. Our algorithms offer insights on the connection and difference between the hinge loss and logistic loss in dropout training. Empirical results on several real datasets demonstrate the effectiveness of dropout training on significantly boosting the classification accuracy of linear SVMs. Introduction Artificial feature noising augments the finite training data with an infinite number of corrupted versions, by corrupting the given training examples with a fixed noise distribution. Among the many noising schemes, dropout training (Hinton et al. 2012) is an effective way to control over-fitting by randomly omitting subsets of features at each iteration of a training procedure. By formulating the feature noising methods as minimizing the expectation of some loss functions under the corrupting distributions, recent work has provided theoretical understandings of such schemes from the perspective of adaptive regularization (Wager, Wang, and Liang 2013); and has shown promising empirical results in various applications, including document classification (van der Maaten et al. 2013; Wager, Wang, and Liang 2013), named entity recognition (Wang et al. 2013), and image classification (Wang and Manning 2013). Regarding the loss functions, though much work has been done on the quadratic loss, logistic loss, or the log-loss inCopyright c © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. duced from a generalized linear model (GLM) (van der Maaten et al. 2013; Wager, Wang, and Liang 2013; Wang et al. 2013), little work has been done on the margin-based hinge loss underlying the very successful support vector machines (SVMs) (Vapnik 1995). One technical challenge is that the non-smoothness of the hinge loss makes it hard to compute or even approximate its expectation under a given corrupting distribution. Existing methods are not directly applicable, therefore calling for new solutions. This paper attempts to address this challenge and fill up the gap by extending dropout training as well as other feature noising schemes to support vector machines. Previous efforts on learning SVMs with feature noising have been devoted to either explicit corruption or an adversarial worst-case analysis. For example, virtual support vector machines (Burges and Scholkopf 1997) explicitly augment the training data, which are usually support vectors from previous learning iterations for computational efficiency, with additional examples that are corrupted through some invariant transformation models. A standard SVM is then learned on the corrupted data. Though simple and effective, such an approach lacks elegance and the computational cost of processing the additional corrupted examples could be prohibitive for many applications. The other work (Globerson and Roweis 2006; Dekel and Shamir 2008; Teo et al. 2008) adopts an adversarial worst-case analysis to improve the robustness of SVMs against feature deletion in testing data. Though rigorous in theory, a worst-case scenario is unlikely to be encountered in practice. Moreover, the worst-case analysis usually results in solving a complex and computationally demanding problem. In this paper, we show that it is efficient to train linear SVM predictors on an infinite amount of corrupted copies of the training data by marginalizing out the corruption distributions, an average-case analysis. We concentrate on dropout training, but the results are directly applicable to other noising models, such as Gaussian, Poisson and Laplace (van der Maaten et al. 2013). For all these noising schemes, the resulting expected hinge loss can be upperbounded by a variational objective by introducing auxiliary variables, which follow a generalized inverse Gaussian distribution. We then develop an iteratively re-weighted least square (IRLS) algorithm to minimize the variational bounds. At each iteration, our algorithm minimizes the expectation of a re-weighted quadratic loss under the given corrupting distribution, where the re-weights are computed in a simple closed form. We further apply the similar ideas to develop a new IRLS algorithm for the dropout training of logistic regression, which extends the well-known IRLS algorithm for standard logistic regression (Hastie, Tibshirani, and Friedman 2009). Our IRLS algorithms shed light on the connection and difference between the hinge loss and logistic loss in the context of dropout training, complementing to the previous analysis (Rosasco et al. 2004; Globerson et al. 2007) in the supervised learning settings. Finally, empirical results on classification and a challenging “nightmare at test time” scenario (Globerson and Roweis 2006) demonstrate the effectiveness of our approaches, in comparison with various strong competitors. Preliminaries We setup the problem in question and review the learning with marginalized corrupted features. Regularized loss minimization Consider the binary classification, where each training example is a pair (x, y) with x ∈ R being an input feature vector and y ∈ {+1,−1} being a binary label. Given a set of training data D = {(xn, yn)}n=1, supervised learning aims to find a function f ∈ F that maps each input to a label. To find the optimal candidate, it commonly solves a regularized loss minimization problem min f∈F Ω(f) + 2c · R(D; f), (1) whereR(D; f) is the risk of applying f to the training data; Ω(f) is a regularization term to control over-fitting; and c is a non-negative regularization parameter. For linear models, the function f is simply parameterized as f(x;w, b) = w>x+ b, where w is the weight vector and b is an offset. We will denote θ := {w, b} for clarity. Then, the regularization can be any Euclidean norms1, e.g., the `2-norm, Ω(w) = ‖w‖2, or the `1-norm, Ω(w) = ‖w‖1. For the loss functions, the most relevant measure is the training error, ∑N n=1 δ(f(xn;θ) 6= yn), which however is not easy to optimize. A convex surrogate loss is used instead, which normally upper bounds the training error. Two popular examples are the hinge loss and logistic loss2:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparative Study of Extreme Learning Machines and Support Vector Machines in Prediction of Sediment Transport in Open Channels

The limiting velocity in open channels to prevent long-term sedimentation is predicted in this paper using a powerful soft computing technique known as Extreme Learning Machines (ELM). The ELM is a single Layer Feed-forward Neural Network (SLFNN) with a high level of training speed. The dimensionless parameter of limiting velocity which is known as the densimetric Froude number (Fr) is predicte...

متن کامل

Face Recognition using Eigenfaces , PCA and Supprot Vector Machines

This paper is based on a combination of the principal component analysis (PCA), eigenface and support vector machines. Using N-fold method and with respect to the value of N, any person’s face images are divided into two sections. As a result, vectors of training features and test features are obtain ed. Classification precision and accuracy was examined with three different types of kernel and...

متن کامل

Separating Well Log Data to Train Support Vector Machines for Lithology Prediction in a Heterogeneous Carbonate Reservoir

The prediction of lithology is necessary in all areas of petroleum engineering. This means that to design a project in any branch of petroleum engineering, the lithology must be well known. Support vector machines (SVM’s) use an analytical approach to classification based on statistical learning theory, the principles of structural risk minimization, and empirical risk minimization. In this res...

متن کامل

A comparative study of performance of K-nearest neighbors and support vector machines for classification of groundwater

The aim of this work is to examine the feasibilities of the support vector machines (SVMs) and K-nearest neighbor (K-NN) classifier methods for the classification of an aquifer in the Khuzestan Province, Iran. For this purpose, 17 groundwater quality variables including EC, TDS, turbidity, pH, total hardness, Ca, Mg, total alkalinity, sulfate, nitrate, nitrite, fluoride, phosphate, Fe, Mn, Cu, ...

متن کامل

STAGE-DISCHARGE MODELING USING SUPPORT VECTOR MACHINES

Establishment of rating curves are often required by the hydrologists for flow estimates in the streams, rivers etc. Measurement of discharge in a river is a time-consuming, expensive, and difficult process and the conventional approach of regression analysis of stage-discharge relation does not provide encouraging results especially during the floods. P

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014